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Fast Multiplication of Matrices with Decay 
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A fast algorithm for the approximate multiplication of matrices with decay is introduced; the 
Sparse Approximate Matrix Multiply (SpAMM) reduces complexity in the product space, a 
different approach from current methods that economize within the matrix space through truncation 
or rank reduction. Matrix truncation (element dropping) is compared to SpAMM for quantum 
chemical matrices with approximate exponential and algebraic decay. For matched errors 
in the electronic total energy, SpAMM is found to require fewer to far fewer floating point 
operations relative to dropping. The challenges and opportunities afforded by this new 
approach are discussed, including the potential for high performance implementations. 



INTRODUCTION 

For large dense linear algebra problems, the compu- 
tational advantage offered by fast matrix- matrix multi- 
plication can be substantial, even with seemingly small 
gains in asymptotic complexity. Relative to conven- 
tional multiplication which is 0(n 3 ), Strassen's algo- 
rithm achieves (9(n 2,8 ), while the Coppersmith and 
Winograd method is 0(n 2 38 ). For these dense meth- 
ods, balancing the trade off between cost, complexity and 
error is an active area of research [TH3]. 

On the other hand, large sparse problems are typ- 
ically handled with conventional sparse matrix tech- 
niques, with only small concessions between multiplica- 
tion algorithms. Intermediate to these regimes, a wide 
class of problems exist that involve matrices with de- 
cay 1 where sparsity exists only asymptotically under an 
approximate linear algebra, historically involving matrix 
economization through element dropping or rank reduc- 
tion. Often, problems with decay occur in the construc- 
tion of matrix functions, notably the matrix inverse [6], 
the matrix exponential [7J, and in the case of electronic 
structure theory, the Heaviside step function [4j |5j EJ |9] . 
The use of an approximate matrix algebra is also an ac- 
tive area of interest in the solution of large eigenproblems 

[ME]. 

Many approaches to a sparse approximate linear alge- 
bra exist for matrices with decay fT4HT7) . largely predi- 
cated upon the truncation of matrix elements, with the 
recent work of Benzi providing the most detailed analy- 
sis so far [U[5]. In this contribution, we develop sparse 
matrix multiplication as a generalized A^-body problem 



A matrix A is said to decay when its matrix elements de- 
crease exponentially, as \aij\ < c\\ l ~ ^ , or algebraically as 
\ a i,j \ < | ._ .? A , with indicial separation \i— j\. In non-synthetic 
cases, the separation \i — j\ typically corresponds to an under- 
lying physical distance |rj — rj\, e.g. of basis functions, finite 
elements, etc. See Figure [3] as well as the excellent work by 
Benzi and co-authors on this topic in References 0][5]. 



[T8] . and introduce a fast algorithm based on hierarchical 
truncation in the three-dimensional space z, j, k G 
of the product Qj = A^B^, where A and B decay 
exponentially or algebraically fast enough 2 . Viewing the 
product from a length scale perspective 1 , if matrix ele- 
ments decay as 0(l/r A ), then the bulk of the product 
interactions will decay as 0(l/r 2X ). For small A, the dif- 
ference between truncation in the matrix space and the 
product space may be significant. 

A SPARSE APPROXIMATE MATRIX MULTIPLY 

The quadtree matrix representation, 

^ = ( Ak+l Ak+l ) i k = 0? * * * J ^max, (1) 

is the basis for recursive matrix-matrix multiplication, 
C k = A k • B k . For conventional recursive multiplication, 
the operator "•" is just the row-column product, while in 
fast multiplication, it represents an economized sequence 
of operations with reduced complexity and a more com- 
plicated error accumulation. In Reference [19J, Bini and 
Lotti carried out a detailed error analysis for recursive 
matrix multiplication schemes, and derived component- 
wise bounds of the form 

\cij - Cij \ < aben\g 2 n (2) 

where clj is a matrix element computed to within preci- 
sion e, Cij is its exact counterpart, a = max^ \a^\ and 
b = max^j \bij\. While providing a sharp bound, the 
max norm does not immediately lend itself to the re- 
cursive separation of interaction magnitudes. Consider 



2 Algebraic decay sufficient to achieve a fast O (n lg n) or better 
complexity is an open question similar to that of conditional 
convergence in the shape dependent summation of dipole and 
quadrupole components of the Lorentz field. 
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instead the framework provided by an arbitrary sub- 
multiplicative matrix norm || • ||: 
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This structure suggests an algorithm we call the Sparse 
Approximate Matrix Multiply (SpAMM), which recur- 
sively tests each of the 8 contributions in Equation ^ for 
significance in accordance with a given numerical thresh- 
old r: 



SpAMM(A /c , B k ) 



( SpAMM(^+ 1 ,5 1 Y 1 ) 
+SpAMM(^+ 1 ,^+ 1 ) 





A k . B k 

SpAMM(^+ 1 ,5 1 fc 2 +1 ) \ 
+SpAMM(^+\^ 2 +1 ) 



SpAMM^ 1 , SpAMM(^+\ 



\+SpAMM(^+ i ,^+ i ) 



+SpAMM(4+ 1 ,^+ 1 ) J 



if 
elseif 

else 



B k \\ 
k — kr 



< T 



(4) 



r 



The truncated product space accessed by SpAMM is 
shown in Figure [I] for matrices with exponential and al- 
gebraic decay. 



bound for recursive multiplication employed by Demmel 
and co-workers in Reference |2J: 




Figure 1: Hierarchical truncation of the product space (i, j, k) 
using r = 10 ~ 8 for synthetic matrices of dimension n — 512, 



with Aij = exp (— \z — j\) and Bij = exp (—2 \i 
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At each tier in SpAMM, the local truncation error is 
bounded by \\C k — C k \\ < r, with an error accumulation 
structurally similar to rounding in conventional recur- 
sive multiplication, except that truncation is not guaran- 
teed to occur at each tier in the recursion, and trunca- 
tion does not retain even the approximate magnitude of 
the avoided sub-product. To see the difference between 
truncation and round-off, consider the generic, norm- wise 



||C-C|| <Mn) + O (e 2 ) 



(5) 



with a low order polynomial 



and d > 1. Un- 



like rounding which is an error of commission, SpAMM 
creates errors of omission. If r < ||A|| then no work 
is performed and we obviously have||C — C\\ < r, which is 
different than \\C — C\\ < e||A||||£?|| corresponding to the 
case of comparable round-off and truncation parameters 
e ~ r. One could certainly bound SpAMM rigorously 



by decreasing r with increasing depth, r 



fc+i _ 



78, but 



that would be overly pessimistic, not taking into account 
signed error accumulation or the localization and atten- 
uation of errors due to decay. 

SpAMM is similar to the H-matrix algebra of Hack- 
busch and co-workers, where off-diagonal sub-matrices 
are treated as reduced rank factorizations (truncated 
SVD), typically structured and grouped to reflect prop- 
erties of the underlying operators [20J . For problems with 
rapid decay, truncated SVD may behave in a similar way 
to simple dropping schemes. SpAMM is different than 
the ^-matrix algebra as it achieves separation uniquely 
in the product space and does not rely on a reduced com- 
plexity representation of matrices. For very slow decay, 
the H-matrix algebra may certainly offer a path for in- 
tractable problems for which SpAMM is ineffective. 
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Figure 2: Space filling curves map atoms in Cartesian space 
(A) onto an index that is locality preserving (B), leading to 
clustering of matrix elements with respect to indicial separa- 
tion (C) when interactions are short ranged. The octree gen- 
erated by SpAMM in the three-dimensional product space, 
shown in Figure [I] is equivalent to a second space filling 
curve (spatial hashing) [21-23] that orders both matrices (C) 
and the product space (D), features that can be exploited to 
achieve domain decomposition and load balance (colors in C 
and D). 

An understanding of the error accumulation in 
SpAMM must account for the decay properties of A and 
B, which in non- synthetic cases are intimately related to 
the effects of ordering and structure of the underlying 
physical, chemical or engineering application. Also, or- 
dering will determine the relative efficiencies of both ma- 
trix truncation and SpAMM, particularly under block- 
ing. The ordering used here is based on the Space Fill- 
ing Curve (SFC) method described in [9j (Hilbert atom 
ordering) , and shown in Figure |5J involving also a sec- 
ond tier of ordering in the product space. The first or- 
dering maps atoms that are close in Cartesian-space to 
entries close in the index space of the matrix. The sec- 
ond ordering is a natural consequence of the SpAMM 
multiply, which recursively maps out a multi-level oc- 
tree (see also Figure [T]) with cuboid coordinates that are 
equivalent to a spatial-hash; sorting the hash produces 
a three-dimensional SFC in the product space [2TH23] . 
This two-tiered structure is novel, providing an encom- 
passing scheme that parlays the locality of physical inter- 
actions into the data locality of matrices (element clus- 
tering), and into the irregular product space. This ap- 
proach is applicable to other problems with underlying 
decay properties, in which the finite-elements, radial ba- 
sis functions, etc. replace atoms, or graph theoretical 
methods (nested dissection, Cuthill-McKee, etc.) replace 



the first-tier SFC altogether. In either case, the corre- 
spondence between the recursive SpAMM product space 
and a three-dimensional SFC provides a tool to exploit 
both spatial and temporal locality in the distribution of 
work and data. 



RESULTS AND DISCUSSION 
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Figure 3: Decay of normed density matrix atom-blocks 
|| Pab || F with Cartesian separation \R a — Rb\ for the largest 
molecules in each sequence: a 450 atom water cluster (red), 
a 752 atom (4,3) nanotube (orange) and a 780 atom (3,3) 
nanotube (blue). 

In this initial contribution, we limit ourselves to ex- 
ploring the numerical and computational behavior of 
SpAMM applied to problems in electronic structure the- 
ory, and compare its relative merits to the dropping of 
matrix elements. In O (n) electronic structure calcula- 
tions fT5UT7] . a primary source of error due to sparse 
matrix-multiplication develops from early steps in the 
iterative construction of the Heaviside matrix function 
P = 0[F — ill] (density matrix purification), which is a 
projector of the effective Hamiltonian (Fockian) F |24| . 
Starting from the basis spanning F, purification drives 
eigenvalues to 0s or Is, whereupon error accumulation 
due to an approximate linear algebra is quenched |24| . 
Under a given approximate linear algebra, the electronic 
energy E e \ = Tr(P.F) is a global measure of accumu- 
lated error. In the following Section, we carry out pu- 
rification on three molecular sequences of increasing size, 
water clusters, (4,3) nanotubes and (3,3) nanotubes. In 
each case, a Fockian F was obtained from a fully con- 
verged Self-Consistent-Field (SCF) cycle and used as ba- 
sis for the purification; our numerical experiments probe 
only errors within one density matrix solve (40-60 multi- 
plies), and do not address error propagation throughout 
the SCF cycle. As the rate of decay slows towards SCF 
convergence, these calculations represent the instance of 
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minimum decay, shown for each of the largest molecu- 
lar species in Figure [3j Within each sequence, the same 
number of iterative steps (40-60 matrix multiplications) 
are taken using the TC2 purification algorithm [25] and 
either: (I) matrix element dropping and exact multipli- 
cation or (II) SpAMM. 
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Figure 4: Average number of 4x4 matrix multiplies (MAT- 
MULs) for a sequence of (4,3) nanotubes at the RHF/STO-2G 
level of theory with dropping (red) and SpAMM (blue). 
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Figure 5: Average number of 4x4 matrix multiplies (MAT- 
MULs) for a sequence of (3,3) nanotubes at the LDA/STO- 
2G/ level of theory with dropping (red) and SpAMM (blue). 

In this study, we chose /c max to yield 4x4 blocks at 
the finest level of resolution, corresponding to the most 
aggressive use of single precision SSE on the x86 archi- 
tecture. In all cases the matrix norm employed is the 



Frobenius norm 



? . Thresholds, r, have been ad- 



justed to roughly match relative errors in the electronic 
energy, AE e \ = \E G \ — E e \\/\E G \\ between the two schemes, 
(I) and (II), and the average number of 4x4 MATMULs 
per purification step are reported. Multiplications are 
only a proxy for CPU time, as neither case accounts for 
symbolic overheads associated with the multiply (CSR, 
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Figure 6: Average number of 4x4 matrix multiplies (MAT- 
MULs) for a sequence of water clusters at the RHF/6-31G** 
level of theory with dropping (red) and SpAMM (blue) . 



recursive-tree etc.). In the case of dropping, SpAMM was 
also used in the matrix multiply but with zero threshold. 
After each multiplication in the dropping scheme, a filter 
was applied to the resultant, dropping blocks at the 4x4 
level of resolution using the criteria ||-P fe ||_p < t. Results 
for the three molecular sequences are shown in Figures [6j 
HandE 

For comparable values of AE e \, SpAMM was found to 
employ from slightly fewer multiplies for the (4,3) nan- 
otube, to dramatically less in the case of the water clus- 
ters. This result is somewhat surprising, since the spatial 
decay is slower in the case of the (4,3) nanotube than for 
the water clusters, as shown in Figure [3J While it seems 
reasonable to attribute this unexpected result to the ef- 
fects of dimensionality, further study is required to be 
sure. Its also worth noting that the advantage of SpAMM 
relative to dropping is brought down with decreasing r; 
for r = they both revert to the same 0(n 3 ) complexity. 
For both systems with exponential decay, error appears 
tightly controlled albeit within an as yet unknown bound. 
Note however, that unlike matrix truncation which leads 
to a product error that is O (t 2 ), SpAMM leads to a 
truncation error that is O (r) . 

Comparing the quasi one-dimensional metallic (3,3) 
nanotube with slow algebraic decay to the insulating 
(4,3) nanotube with exponential decay, SpAMM gains 
substantially over dropping in the case of slower decay. 
However, the ability of SpAMM to achieve a linear scal- 
ing complexity in the case of the metallic system remains 
in question, as in the case of the tightest threshold, the 
SpAMM errors do not appear to be well controlled, and 
the cost does not appear linear, at least not in this size 
regime. On the other hand, the SpAMM result does en- 
joy a significant reduction in cost relative to dropping, 
and the error increase is modest. 



5 



CONCLUSION 

The Sparse Approximate Matrix Multiply (SpAMM) 
is a fast method for matrices with decay, which is dif- 
ferent from element dropping or the H-algebra in that it 
uniquely involves truncation of the product space rather 
than the matrix space. For matrices with exponential 
or fast algebraic decay, SpAMM can achieve stable er- 
ror control comparable to element dropping, but with a 
greatly reduced number of floating point operations. The 
results presented here are preliminary, and have not yet 
explored the interesting problems of slow decay in the 
asymptotic limit, ordering, error bounds, the cost of re- 
cursion, high performance implementations or broader 
gauges of accuracy and efficiency, such as the use of 
SpAMM in the context of the Self-Consistent-Field cy- 
cle. Of particular concern is the relationship between 
complexity, matrix decay and error control. Based on 
our numerical tests, we postulate that the algorithm is 
at worst 0(n\gn) for matrices with sufficiently fast de- 
cay. 

In addition to similarities with the H-algebra, SpAMM 
falls under the rubric of the generalized TV-body problem 
[18J. From this perspective, its worth noting also the con- 
nection between matrix-matrix multiplication as A^-body 
problem and matrix-matrix multiplication as spatial join 
[26J, as well as between A/"-body problems and data base 
theory in general [23J. 

Next, we draw attention to the second tier of Space 
Filling Curve (SFC) shown in Figure |2j which pro- 
vides a mechanism for domain decomposition and load 
balance that is proven for parallel irregular problems 
[2TJ m [271 [28]. Also, the improvement gained in Ref- 
erence [29J on going from a one-dimensional to a two- 
dimensional matrix partitioning scheme for the parallel 
SpMM suggests that partitioning the three-dimensional 
product space instead may provide an even higher degree 
of flexibility and granularity. 
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